Problem Statement¶
Context¶
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective¶
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries¶
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to split the data into train and test sets
from sklearn.model_selection import train_test_split
from sklearn import metrics
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.tree import export_text
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
Loading the dataset¶
# uncomment and run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Write your code here to read the data
data = pd.read_csv('/content/drive/MyDrive/Great Learning/Machine Learning/Project 02/Loan_Modelling.csv')
original_data = data.copy()
Data Overview¶
- Observations
- Sanity checks
Checking the first and last 5 rows of dataset¶
# checking five top rows of the dataset
data.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# checking five bottom rows of the dataset
data.tail(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Understand the shape of the dataset¶
# checking the shape of the data (how many rows and columns are present in the dataset)
data.shape
(5000, 14)
- The dataset has 5000 rows and 14 columns.
# Check for missing values in each column
data.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
- There are no missing values in dataset.
Checking the data types and non-null Counts¶
# Checking the data types of the columns in dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
- All the columns are numerical with 13 int64 and 1 float64 variables in the data.
Checking for Null Values¶
data.isnull().values.any()
False
- There are no null/NaN values in the data.
Checking for duplicate values¶
# checking for duplicate values
data.duplicated().sum()
0
- There are no duplicate values in the data.
Checking the Statistical Summary¶
# Checking the statistical summary
data.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
- The average age of applicants is ~45 years.
- On average, customers spend ~1,940 USD on credit cards per month.
- The average value of house mortgage is ~101,713 USD, withe higest being ~635,000 USD.
Check Uniqueness of the dataset¶
data.nunique()
| 0 | |
|---|---|
| ID | 5000 |
| Age | 45 |
| Experience | 47 |
| Income | 162 |
| ZIPCode | 467 |
| Family | 4 |
| CCAvg | 108 |
| Education | 3 |
| Mortgage | 347 |
| Personal_Loan | 2 |
| Securities_Account | 2 |
| CD_Account | 2 |
| Online | 2 |
| CreditCard | 2 |
Checking the unique entries in Personal_Loan column¶
print(data['Personal_Loan'].unique())
[0 1]
- Checking the unique entries in Personal_Loan column.
Dropping column¶
# Dropping ID column from the data
data.drop('ID', axis=1, inplace=True)
Exploratory Data Analysis¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Univariate Analysis¶
# Define the list of numerical features
num_features = ['Age', 'Income', 'Mortgage']
# Set up the figure with subplots: 1 row, 3 columns
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
# Loop through each feature and corresponding subplot axis
colors = ['lightgreen', 'skyblue', 'orange']
for i, feature in enumerate(num_features):
sns.histplot(data=data, x=feature, ax=axes[i], color=colors[i])
axes[i].set_title(f"Histogram of {feature}", fontsize=14, fontweight='bold', color='navy')
axes[i].set_xlabel(f"{feature} (units)", fontsize=12)
axes[i].set_ylabel("Frequency", fontsize=12)
axes[i].tick_params(axis='y', left=False)
plt.tight_layout()
plt.show()
- Age exhibits a normal distribution with some mild fluctuations.
- Income and Mortgage are right-skewed. However, the mortgage data indicates that most individuals have little to no mortgage, with a significant concentration at zero.
- These patterns could indicate a population with moderate earnings and limited financial obligations in terms of property loans.
cols_to_plot = ['CreditCard', 'Family', 'Personal_Loan'] # Reordered
y_labels = {
'CreditCard': 'Credit Card Ownership',
'Family': 'Family Size',
'Personal_Loan': 'Loan'
}
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20, 6)) # 3 columns
for ax, col in zip(axes, cols_to_plot):
top_vals = data[col].value_counts().nlargest(20)
labels = top_vals.index.astype(str)
counts = top_vals.values
# Create a DataFrame for seaborn compatibility
plot_df = pd.DataFrame({col: labels, 'Count': counts})
sns.barplot(
data=plot_df,
x='Count',
y=col,
hue=col,
palette='Set2',
dodge=False,
legend=False,
ax=ax
)
ax.set_title(f"Distribution of {y_labels[col]}", fontsize=14, fontweight='bold', color='navy')
ax.set_xlabel("Total Number", fontsize=12)
ax.set_ylabel(y_labels[col], fontsize=12)
plt.tight_layout()
plt.show()
Single-member families form the largest segment (30% of the population), while 3-member families are the smallest group (20%).
A significant majority (~90%) do not have loans, indicating either limited access to or low reliance on borrowing in the observed population.
This suggests a population that is largely financially self-reliant, and that smaller households dominate the dataset.
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
# Boxplot for Income
sns.boxplot(x=data['Income'], color='skyblue', ax=axes[0])
axes[0].set_title("Boxplot of Income", fontsize=14, fontweight='bold', color='navy')
axes[0].set_xlabel("Income ('000 USD)", fontsize=12)
axes[0].set_yticks([])
# Boxplot for Mortgage
sns.boxplot(x=data['Mortgage'], color='orange', ax=axes[1])
axes[1].set_title("Boxplot of Mortgage", fontsize=14, fontweight='bold', color='navy')
axes[1].set_xlabel("Mortgage ('000 USD)", fontsize=12)
axes[1].set_yticks([])
plt.tight_layout()
plt.show()
- Both income and mortgage distributions are positively skewed, with longer right tails indicating the presence of high-value outliers.
- Income has a moderate number of high-income outliers present above ~180,000 USD. While Mortgage there is a very high number of outliers, especially beyond ~300,000 USD.
- Income inequality is moderate, but mortgage disparity is high.
- The concentration of zeros in mortgage data reflects limited borrowing or widespread absence of home loans in the population.
# Histograms for Age and Income side by side
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
# Histogram for Age
sns.histplot(data=data, x='Age', bins=20, color='teal', kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age', fontsize=14, fontweight='bold', color='navy')
axes[0].set_xlabel('Age (in years)')
axes[0].set_ylabel('Count')
axes[0].set_ylim(0, 400)
# Histogram for Income
sns.histplot(data=data, x='Income', bins=20, color='steelblue', kde=True, ax=axes[1])
axes[1].set_title('Distribution of Income', fontsize=14, fontweight='bold', color='navy')
axes[1].set_xlabel("Income ('000 USD)")
axes[1].set_ylabel('Count')
axes[1].set_ylim(0, 400)
plt.tight_layout()
plt.show()
Income is heavily skewed, with the majority (over 75% of the population) earning less than 100,000 USD, and a long right tail extending beyond 200,000 USD.
The peak income range falls between 40,000 and 80,000 USD, where the highest concentration (~400 individuals per bin) is observed.
# Selected columns and custom labels
cols_to_plot = ['Mortgage', 'CCAvg']
x_labels = {'Mortgage': 'Mortgage (000 USD)', 'CCAvg': 'Avg Credit Card Spend (000 USD)'}
# Create side-by-side KDE plots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
for ax, col in zip(axes, cols_to_plot):
sns.kdeplot(data[col], fill=True, color='teal', ax=ax)
ax.set_title(f"Density Plot of {col}", fontweight='bold', color='navy') # Only the column name
ax.set_xlabel(x_labels[col]) # '000 USD' shown only here
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()
A large portion of individuals (~65–70%) have zero mortgage, highlighting low mortgage penetration.
Credit card spending is modest for most, with the majority spending around ~1,000 USD to 1,500 USD per month.
Outliers exist for both variables, with mortgages reaching ~700K USD and credit card spends up to ~10K USD, though they represent a small fraction (<5%) of the population.
Bivariate Analysis¶
# To determine the correlation matrix
data.corr()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.994215 | -0.055269 | -0.030530 | -0.046418 | -0.052012 | 0.041334 | -0.012539 | -0.007726 | -0.000436 | 0.008043 | 0.013702 | 0.007681 |
| Experience | 0.994215 | 1.000000 | -0.046574 | -0.030456 | -0.052563 | -0.050077 | 0.013152 | -0.010582 | -0.007413 | -0.001232 | 0.010353 | 0.013898 | 0.008967 |
| Income | -0.055269 | -0.046574 | 1.000000 | -0.030709 | -0.157501 | 0.645984 | -0.187524 | 0.206806 | 0.502462 | -0.002616 | 0.169738 | 0.014206 | -0.002385 |
| ZIPCode | -0.030530 | -0.030456 | -0.030709 | 1.000000 | 0.027512 | -0.012188 | -0.008266 | 0.003614 | -0.002974 | 0.002422 | 0.021671 | 0.028317 | 0.024033 |
| Family | -0.046418 | -0.052563 | -0.157501 | 0.027512 | 1.000000 | -0.109275 | 0.064929 | -0.020445 | 0.061367 | 0.019994 | 0.014110 | 0.010354 | 0.011588 |
| CCAvg | -0.052012 | -0.050077 | 0.645984 | -0.012188 | -0.109275 | 1.000000 | -0.136124 | 0.109905 | 0.366889 | 0.015086 | 0.136534 | -0.003611 | -0.006689 |
| Education | 0.041334 | 0.013152 | -0.187524 | -0.008266 | 0.064929 | -0.136124 | 1.000000 | -0.033327 | 0.136722 | -0.010812 | 0.013934 | -0.015004 | -0.011014 |
| Mortgage | -0.012539 | -0.010582 | 0.206806 | 0.003614 | -0.020445 | 0.109905 | -0.033327 | 1.000000 | 0.142095 | -0.005411 | 0.089311 | -0.005995 | -0.007231 |
| Personal_Loan | -0.007726 | -0.007413 | 0.502462 | -0.002974 | 0.061367 | 0.366889 | 0.136722 | 0.142095 | 1.000000 | 0.021954 | 0.316355 | 0.006278 | 0.002802 |
| Securities_Account | -0.000436 | -0.001232 | -0.002616 | 0.002422 | 0.019994 | 0.015086 | -0.010812 | -0.005411 | 0.021954 | 1.000000 | 0.317034 | 0.012627 | -0.015028 |
| CD_Account | 0.008043 | 0.010353 | 0.169738 | 0.021671 | 0.014110 | 0.136534 | 0.013934 | 0.089311 | 0.316355 | 0.317034 | 1.000000 | 0.175880 | 0.278644 |
| Online | 0.013702 | 0.013898 | 0.014206 | 0.028317 | 0.010354 | -0.003611 | -0.015004 | -0.005995 | 0.006278 | 0.012627 | 0.175880 | 1.000000 | 0.004210 |
| CreditCard | 0.007681 | 0.008967 | -0.002385 | 0.024033 | 0.011588 | -0.006689 | -0.011014 | -0.007231 | 0.002802 | -0.015028 | 0.278644 | 0.004210 | 1.000000 |
plt.figure(figsize=(14, 8))
data = data.drop(columns=['ID'], errors='ignore') # Drop the 'ID' column
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Age vs Experience (0.99); Income vs CCAvg (0.65), and Income vs Personal_Loan (0.50) exhibit strong positive correlations.
Mortgage vs Income (0.21), CD_Account vs Personal_Loan / Securities_Account (0.32) and CreditCard vs CD_Account (0.28) exhibit moderate positive correlations.
While, Income vs Education (-0.19) and Personal_Loan vs Education (0.14) exhibit negative and Weak or negative correlations respectively.
The most meaningful predictors for financial behavior (like credit card spending and loan uptake) are Income and Age/Experience.
Variables such as ZIPCode, Family, and Online show very weak relationships, suggesting they may not influence other attributes significantly on their own.
Multicollinearity between Age and Experience is almost perfect (0.99), so one should be dropped or use as dummy variable in modeling to avoid redundancy.
plt.figure(figsize=(17, 7))
plt.scatter(
data['Income'],
data['Mortgage'],
s=data['Income'], # Bubble size based on Income
c=data['Income'], # Bubble color based on Income
cmap='viridis',
alpha=0.6,
edgecolors='w',
linewidth=0.5
)
plt.colorbar(label='Income (000 USD)')
plt.title("Bubble Chart: Income vs. Mortgage", fontsize=16)
plt.xlabel('Income (000 USD)', fontsize=14)
plt.ylabel('Mortgage (000 USD)', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
There is a positive relationship between Income and Mortgage.
As Income increases, the Mortgage amount generally increases as well — though not in a perfectly linear way.
The bubble size and color intensity both increase with Income, visually reinforcing the upward trend.
The chart confirms that higher income is typically associated with higher mortgage amounts, though this is not absolute — some high earners still have no mortgage.
The presence of zero-mortgage individuals across all income levels may point to diverse financial strategies (e.g., renting, early payoff, or inherited properties).
This visual distribution can be useful for financial institutions when profiling clients for loan offerings or assessing mortgage risk tiers.
# Copy and apply label mappings
data_plot = data.copy()
education_labels = {1: "Undergrad", 2: "Graduate", 3: "Advanced/Professional"}
loan_labels = {0: "Not Accepted", 1: "Accepted"}
binary_labels = {0: "Don't Have", 1: "Have"}
data_plot["Education"] = data_plot["Education"].map(education_labels)
data_plot["Personal_Loan"] = data_plot["Personal_Loan"].map(loan_labels)
data_plot["Securities_Account"] = data_plot["Securities_Account"].map(binary_labels)
data_plot["CD_Account"] = data_plot["CD_Account"].map(binary_labels)
data_plot["CreditCard"] = data_plot["CreditCard"].map(binary_labels)
# Bin continuous variables
data_plot["Age"] = pd.cut(data_plot["Age"], bins=5)
data_plot["Experience"] = pd.cut(data_plot["Experience"], bins=5)
data_plot["Income"] = pd.cut(data_plot["Income"], bins=5)
data_plot["CCAvg"] = pd.cut(data_plot["CCAvg"], bins=5)
data_plot["Mortgage"] = pd.cut(data_plot["Mortgage"], bins=5)
# Define feature-label pairs
features = [
("Age", "Age Group"),
("Experience", "Experience (Years)"),
("Income", "Annual Income (K$)"),
("Family", "Family Size"),
("CCAvg", "Credit Card Spend (K$)"),
("Education", "Education Level"),
("Mortgage", "Mortgage Value (K$)"),
("Securities_Account", "Securities Account"),
("CD_Account", "CD Account"),
("CreditCard", "Credit Card")
]
# Softer color options
soft_palettes = ['Pastel1', 'Set3', 'Accent', 'Pastel2', 'Set2']
# Create subplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(14, 25))
axes = axes.flatten()
for i, (feature, xlabel) in enumerate(features):
ctab = pd.crosstab(data_plot[feature], data_plot["Personal_Loan"])
ctab.plot(kind='bar', stacked=True, colormap=soft_palettes[i % len(soft_palettes)], ax=axes[i])
axes[i].set_title(f"Personal Loan by {xlabel}", fontsize=13)
axes[i].set_xlabel(xlabel, fontsize=11)
axes[i].set_ylabel("Number of Customers", fontsize=11)
axes[i].legend(title="Loan Status")
axes[i].tick_params(axis='x', rotation=30)
plt.tight_layout()
plt.show()
- Personal Loan by Credit Card Spend: Higher credit card spending appears to correlate with more loan acceptances. Also, lower spending groups have fewer accepted personal loans.
- Personal Loan by Education Level: Advanced/Professional education holders tend to have the highest number of accepted loans. Undergraduate education levels show fewer accepted loans compared to graduates.
- Personal Loan by Securities Account: Customers with securities accounts have a higher acceptance rate for personal loans. Those without securities accounts receive fewer accepted loans.
- Personal Loan by Mortgage Value: Loan acceptances increase with higher mortgage value brackets. Customers in lower mortgage value categories have fewer loan approvals.
- Personal Loan by CD Account: Customers with CD accounts tend to have more accepted personal loans. Not having a CD account correlates with lower loan approvals.
- Personal Loan by Credit Card Ownership: Owning a credit card seems to be linked with higher personal loan acceptance rates. Those without credit cards show lower loan acceptance numbers.
- Personal Loan by Age Group: Middle-aged groups (31-49 years) receive more accepted loans. Younger and older age groups have lower acceptance rates.
- Personal Loan by Experience: Loan acceptance increases with experience up to a point, then stabilizes. Those with minimal experience have fewer approved personal loans.
- Personal Loan by Annual Income: Higher annual income groups receive more accepted loans. Those with lower incomes have fewer loan approvals.
- Personal Loan by Family Size Family size of 1 or 2 sees the highest loan acceptances. Larger family sizes have fewer approved personal loans.
# Include 'Personal_Loan' in the data for hue grouping
sns.pairplot(data[['Age', 'Experience', 'Income', 'Mortgage', 'Personal_Loan']], hue='Personal_Loan')
# Add title
plt.suptitle('Pair Plot of Continuous Variables', y=1.02)
plt.show()
Income is the most significant factor influencing personal loan acceptance.
Education level matters — more educated customers are more likely to accept loan.
Mortgage value is moderately correlated with loan acceptance.
Family size has limited influence on loan acceptance.
Age and experience do not strongly predict loan behavior.
Visualizations together reveal that targeting higher-income, well-educated customers may be more effective for personal loan offers.
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Boxplot 1: Income vs Personal_Loan
sns.boxplot(x='Personal_Loan', y='Income', data=data, ax=axes[0])
axes[0].set_title("Income by Personal Loan Status")
axes[0].set_xlabel("Personal Loan")
axes[0].set_ylabel("Income (000 USD)")
# Boxplot 2: Income vs CreditCard (substituted from CD_Account)
sns.boxplot(x='CreditCard', y='Income', data=data, ax=axes[1])
axes[1].set_title("Income by Credit Card Ownership")
axes[1].set_xlabel("Credit Card")
axes[1].set_ylabel("Income (000 USD)")
plt.tight_layout()
plt.show()
# Second pair of plots — keep CD_Account and Securities_Account
binary_cols = ['CD_Account', 'Securities_Account']
numeric_col = 'Income'
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, col in zip(axes, binary_cols):
sns.boxplot(x=col, y=numeric_col, data=data, ax=ax)
ax.set_title(f'{numeric_col} by {col}')
ax.set_xlabel(col)
ax.set_ylabel('Income (000 USD)')
plt.tight_layout()
plt.show()
Those with a certificate of deposit (CD) account show a significantly higher median income (around 120k-130k USD) compared to those without (around 60k-70k USD).
Individuals with a securities account have a marginally higher median income than those without.
Income variability is comparable whether or not someone has a securities account.
Both securities account groups exhibit a similar pattern of high-income outliers.
Having a CD account appears to be a stronger predictor of higher income than having a securities account.
# Map numeric education levels to labels
education_labels = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
data['Education_Label'] = data['Education'].map(education_labels)
# Violin plot with descriptive labels
plt.figure(figsize=(14, 7))
sns.violinplot(x='Education_Label', y='Income', data=data)
plt.title('Income Distribution by Education Level', fontsize=14, fontweight='bold', color='navy')
plt.xlabel('Education Level', fontsize=12)
plt.ylabel('Income (000 USD)', fontsize=12)
plt.show()
data.drop(columns=['Education_Label'], inplace=True)
Questions:
- What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
- There is a large number of outliers, especially beyond the 300,000 USD mark.
- How many customers have credit cards?
- 1470 customers have credit cards.
- What are the attributes that have a strong correlation with the target attribute (personal loan)?
- Income (0.50) and average credit card spending (CCAvg) (0.37) show moderate positive correlations with the likelihood of taking a personal loan.
- How does a customer's interest in purchasing a loan vary with their age?
- Based on the heatmap visual, the correlation coefficient between 'Age' and 'Personal_Loan' is -0.01. This very weak negative correlation suggests that there is virtually no linear relationship between a customer's age and their likelihood of purchasing a personal loan in this dataset. In other words, age, by itself, is not a strong predictor of whether someone will take out a loan.
- How does a customer's interest in purchasing a loan vary with their education?
- The correlation coefficient between 'Education' and 'Personal Loan' is 0.14. This indicates a weak positive correlation. This suggests a slight tendency for customers with higher levels of education to be more inclined to purchase a personal loan, although the relationship is not very strong.
# Count of customers who have credit cards (CreditCard = 1)
num_with_creditcard = data[data['CreditCard'] == 1].shape[0]
print(f"Number of customers with credit cards: {num_with_creditcard}")
Number of customers with credit cards: 1470
Data Preprocessing¶
Missing Value Treatment¶
Outlier/Missing Value/Contant Column Detection¶
# Detect missing values
print(" Missing Values:")
print(data.isnull().sum()[data.isnull().sum() > 0])
# Detect zero or negative values (optional for certain features)
print("\n Zero or Negative Values (for numeric cols only):")
for col in data.select_dtypes(include=np.number).columns:
if (data[col] <= 0).any():
print(f"{col}: {(data[col] <= 0).sum()}")
# Detect constant columns (no variation)
print("\n Constant Columns:")
constant_cols = [col for col in data.columns if data[col].nunique() == 1]
print(constant_cols)
# Detect outliers using IQR method
print("\n Outlier Detection (using IQR method):")
def detect_outliers_iqr(data):
outlier_info = {}
for col in data.select_dtypes(["float64", "int64"]).columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
outliers = data[(data[col] < Q1 - 1.5 * IQR) | (data[col] > Q3 + 1.5 * IQR)]
if not outliers.empty:
outlier_info[col] = len(outliers)
return outlier_info
outliers = detect_outliers_iqr(data)
print(outliers)
Missing Values:
Series([], dtype: int64)
Zero or Negative Values (for numeric cols only):
Experience: 118
CCAvg: 106
Mortgage: 3462
Personal_Loan: 4520
Securities_Account: 4478
CD_Account: 4698
Online: 2016
CreditCard: 3530
Constant Columns:
[]
Outlier Detection (using IQR method):
{'Income': 96, 'CCAvg': 324, 'Mortgage': 291, 'Personal_Loan': 480, 'Securities_Account': 522, 'CD_Account': 302}
Checking for Anomalous Values¶
# Loop through all columns and report whether negative values exist
for col in data.columns:
if pd.api.types.is_numeric_dtype(data[col]):
invalid_values = data[data[col] < 0][col].unique()
if len(invalid_values) > 0:
print(f"Column '{col}' has negative values: {invalid_values}")
else:
print(f"Column '{col}' has no negative values.")
Column 'Age' has no negative values. Column 'Experience' has negative values: [-1 -2 -3] Column 'Income' has no negative values. Column 'ZIPCode' has no negative values. Column 'Family' has no negative values. Column 'CCAvg' has no negative values. Column 'Education' has no negative values. Column 'Mortgage' has no negative values. Column 'Personal_Loan' has no negative values. Column 'Securities_Account' has no negative values. Column 'CD_Account' has no negative values. Column 'Online' has no negative values. Column 'CreditCard' has no negative values.
# Correcting the experience values that were negative
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
Visualizing Outliers Using Boxplot¶
# List of selected continuous columns to display Outliers
selected_cols = ['Age', 'Experience', 'Income', 'Family', 'Mortgage', 'CCAvg']
# Plot boxplots
plt.figure(figsize=(15, 10))
for i, variable in enumerate(selected_cols):
plt.subplot(2, 3, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.title(variable)
plt.tight_layout()
plt.show()
Outlier Treatment¶
# functions to treat outliers by flooring and capping
def treat_outliers(data, col):
Q1 = data[col].quantile(0.25) # 25th quantile
Q3 = data[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
data[col] = np.clip(data[col], Lower_Whisker, Upper_Whisker)
return data
def treat_outliers_all(data, col_list):
for c in col_list:
data = treat_outliers(data, c)
return data
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
data = treat_outliers_all(data, numerical_col)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 25))
for i, variable in enumerate(['Age', 'Experience', 'Income', 'Family', 'Mortgage', 'CCAvg']):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Feature Engineering¶
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode: 7
## Converting the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
# To check the updated data type of the entries
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null float64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null float64 8 Personal_Loan 5000 non-null category 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(7), float64(3), int64(3) memory usage: 269.7 KB
Data Preparation¶
Creating Training and Test Sets¶
# defining the explanatory (independent) and response (dependent) variables
X = data.drop(["Personal_Loan"], axis=1)
Y = data["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
# specifying the datatype of the independent variables data frame
X = X.astype(float)
# I will use 70% of data for training and 30% for testing.
# Use stratify to preserve class balance
x_train, x_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
# Preview the training set
x_train.head()
| ID | Age | Experience | Income | Family | CCAvg | Mortgage | Securities_Account | CD_Account | Online | ... | ZIPCode_96003 | ZIPCode_96008 | ZIPCode_96064 | ZIPCode_96091 | ZIPCode_96094 | ZIPCode_96145 | ZIPCode_96150 | ZIPCode_96651 | Education_2 | Education_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3465 | 3466.0 | 65.0 | 41.0 | 42.0 | 1.0 | 1.9 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4582 | 4583.0 | 25.0 | -1.0 | 69.0 | 3.0 | 0.3 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1922 | 1923.0 | 39.0 | 15.0 | 25.0 | 1.0 | 1.4 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1186 | 1187.0 | 62.0 | 38.0 | 43.0 | 4.0 | 1.2 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3718 | 3719.0 | 45.0 | 19.0 | 8.0 | 2.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 479 columns
To Check the Proportion of theTraining/Test Set from the Split Data¶
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 479) Shape of test set : (1500, 479) Percentage of classes in training set: 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64
Age and Experience have no outliers. Income, Mortgage and Average Credit Card spending (CCAvg) have upper outliers.
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
Model Building¶
Model Evaluation Criterion¶
I create one utility function to aggregate all evaluation metrics into a single DataFrame, and another to visualize the confusion matrix.
Model Evaluation¶
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def plot_confusion_matrix(model, predictors, target):
# Predict the target values using the provided model and predictors
y_pred = model.predict(predictors)
# Compute the confusion matrix comparing the true target values with the predicted values
cm = confusion_matrix(target, y_pred)
# Create labels for each cell in the confusion matrix with both count and percentage
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2) # reshaping to a matrix
# Set the figure size for the plot
plt.figure(figsize=(6, 4))
# Plot the confusion matrix as a heatmap with the labels
sns.heatmap(cm, annot=labels, fmt="")
# Add a label to the y-axis
plt.ylabel("True label")
# Add a label to the x-axis
plt.xlabel("Predicted label")
Decision Tree (sklearn default)¶
# creating an instance of the decision tree model
model1 = DecisionTreeClassifier(random_state=1, criterion="gini") # random_state sets a seed value and enables reproducibility
# fitting the model to the training data
model1.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
# Visualiazing Confusion Matrix for Model 1
plot_confusion_matrix(model1, x_train, y_train)
model1_train_perf = model_performance_classification(
model1, x_train, y_train
)
model1_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
plot_confusion_matrix(model1, x_test, y_test)
model1_test_perf = model_performance_classification(
model1, x_test, y_test
)
model1_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.881944 | 0.927007 | 0.903915 |
- The training performance is perfect, which is often unrealistic in real-world scenarios and suggests the model has memorized the training data.
- There is a huge difference between the training and test F1 Scores.
- This indicates that the model is overfitting.
Visualizing the Decision Tree¶
feature_names = list(x_train.columns)
print(feature_names)
['ID', 'Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_90007', 'ZIPCode_90009', 'ZIPCode_90011', 'ZIPCode_90016', 'ZIPCode_90018', 'ZIPCode_90019', 'ZIPCode_90024', 'ZIPCode_90025', 'ZIPCode_90027', 'ZIPCode_90028', 'ZIPCode_90029', 'ZIPCode_90032', 'ZIPCode_90033', 'ZIPCode_90034', 'ZIPCode_90035', 'ZIPCode_90036', 'ZIPCode_90037', 'ZIPCode_90041', 'ZIPCode_90044', 'ZIPCode_90045', 'ZIPCode_90048', 'ZIPCode_90049', 'ZIPCode_90057', 'ZIPCode_90058', 'ZIPCode_90059', 'ZIPCode_90064', 'ZIPCode_90065', 'ZIPCode_90066', 'ZIPCode_90068', 'ZIPCode_90071', 'ZIPCode_90073', 'ZIPCode_90086', 'ZIPCode_90089', 'ZIPCode_90095', 'ZIPCode_90210', 'ZIPCode_90212', 'ZIPCode_90230', 'ZIPCode_90232', 'ZIPCode_90245', 'ZIPCode_90250', 'ZIPCode_90254', 'ZIPCode_90266', 'ZIPCode_90272', 'ZIPCode_90274', 'ZIPCode_90275', 'ZIPCode_90277', 'ZIPCode_90280', 'ZIPCode_90291', 'ZIPCode_90304', 'ZIPCode_90401', 'ZIPCode_90404', 'ZIPCode_90405', 'ZIPCode_90502', 'ZIPCode_90503', 'ZIPCode_90504', 'ZIPCode_90505', 'ZIPCode_90509', 'ZIPCode_90601', 'ZIPCode_90623', 'ZIPCode_90630', 'ZIPCode_90638', 'ZIPCode_90639', 'ZIPCode_90640', 'ZIPCode_90650', 'ZIPCode_90717', 'ZIPCode_90720', 'ZIPCode_90740', 'ZIPCode_90745', 'ZIPCode_90747', 'ZIPCode_90755', 'ZIPCode_90813', 'ZIPCode_90840', 'ZIPCode_91006', 'ZIPCode_91007', 'ZIPCode_91016', 'ZIPCode_91024', 'ZIPCode_91030', 'ZIPCode_91040', 'ZIPCode_91101', 'ZIPCode_91103', 'ZIPCode_91105', 'ZIPCode_91107', 'ZIPCode_91109', 'ZIPCode_91116', 'ZIPCode_91125', 'ZIPCode_91129', 'ZIPCode_91203', 'ZIPCode_91207', 'ZIPCode_91301', 'ZIPCode_91302', 'ZIPCode_91304', 'ZIPCode_91311', 'ZIPCode_91320', 'ZIPCode_91326', 'ZIPCode_91330', 'ZIPCode_91335', 'ZIPCode_91342', 'ZIPCode_91343', 'ZIPCode_91345', 'ZIPCode_91355', 'ZIPCode_91360', 'ZIPCode_91361', 'ZIPCode_91365', 'ZIPCode_91367', 'ZIPCode_91380', 'ZIPCode_91401', 'ZIPCode_91423', 'ZIPCode_91604', 'ZIPCode_91605', 'ZIPCode_91614', 'ZIPCode_91706', 'ZIPCode_91709', 'ZIPCode_91710', 'ZIPCode_91711', 'ZIPCode_91730', 'ZIPCode_91741', 'ZIPCode_91745', 'ZIPCode_91754', 'ZIPCode_91763', 'ZIPCode_91765', 'ZIPCode_91768', 'ZIPCode_91770', 'ZIPCode_91773', 'ZIPCode_91775', 'ZIPCode_91784', 'ZIPCode_91791', 'ZIPCode_91801', 'ZIPCode_91902', 'ZIPCode_91910', 'ZIPCode_91911', 'ZIPCode_91941', 'ZIPCode_91942', 'ZIPCode_91950', 'ZIPCode_92007', 'ZIPCode_92008', 'ZIPCode_92009', 'ZIPCode_92024', 'ZIPCode_92028', 'ZIPCode_92029', 'ZIPCode_92037', 'ZIPCode_92038', 'ZIPCode_92054', 'ZIPCode_92056', 'ZIPCode_92064', 'ZIPCode_92068', 'ZIPCode_92069', 'ZIPCode_92084', 'ZIPCode_92093', 'ZIPCode_92096', 'ZIPCode_92101', 'ZIPCode_92103', 'ZIPCode_92104', 'ZIPCode_92106', 'ZIPCode_92109', 'ZIPCode_92110', 'ZIPCode_92115', 'ZIPCode_92116', 'ZIPCode_92120', 'ZIPCode_92121', 'ZIPCode_92122', 'ZIPCode_92123', 'ZIPCode_92124', 'ZIPCode_92126', 'ZIPCode_92129', 'ZIPCode_92130', 'ZIPCode_92131', 'ZIPCode_92152', 'ZIPCode_92154', 'ZIPCode_92161', 'ZIPCode_92173', 'ZIPCode_92177', 'ZIPCode_92182', 'ZIPCode_92192', 'ZIPCode_92220', 'ZIPCode_92251', 'ZIPCode_92325', 'ZIPCode_92333', 'ZIPCode_92346', 'ZIPCode_92350', 'ZIPCode_92354', 'ZIPCode_92373', 'ZIPCode_92374', 'ZIPCode_92399', 'ZIPCode_92407', 'ZIPCode_92507', 'ZIPCode_92518', 'ZIPCode_92521', 'ZIPCode_92606', 'ZIPCode_92612', 'ZIPCode_92614', 'ZIPCode_92624', 'ZIPCode_92626', 'ZIPCode_92630', 'ZIPCode_92634', 'ZIPCode_92646', 'ZIPCode_92647', 'ZIPCode_92648', 'ZIPCode_92653', 'ZIPCode_92660', 'ZIPCode_92661', 'ZIPCode_92672', 'ZIPCode_92673', 'ZIPCode_92675', 'ZIPCode_92677', 'ZIPCode_92691', 'ZIPCode_92692', 'ZIPCode_92694', 'ZIPCode_92697', 'ZIPCode_92703', 'ZIPCode_92704', 'ZIPCode_92705', 'ZIPCode_92709', 'ZIPCode_92717', 'ZIPCode_92735', 'ZIPCode_92780', 'ZIPCode_92806', 'ZIPCode_92807', 'ZIPCode_92821', 'ZIPCode_92831', 'ZIPCode_92833', 'ZIPCode_92834', 'ZIPCode_92835', 'ZIPCode_92843', 'ZIPCode_92866', 'ZIPCode_92867', 'ZIPCode_92868', 'ZIPCode_92870', 'ZIPCode_92886', 'ZIPCode_93003', 'ZIPCode_93009', 'ZIPCode_93010', 'ZIPCode_93014', 'ZIPCode_93022', 'ZIPCode_93023', 'ZIPCode_93033', 'ZIPCode_93063', 'ZIPCode_93065', 'ZIPCode_93077', 'ZIPCode_93101', 'ZIPCode_93105', 'ZIPCode_93106', 'ZIPCode_93107', 'ZIPCode_93108', 'ZIPCode_93109', 'ZIPCode_93111', 'ZIPCode_93117', 'ZIPCode_93118', 'ZIPCode_93302', 'ZIPCode_93305', 'ZIPCode_93311', 'ZIPCode_93401', 'ZIPCode_93403', 'ZIPCode_93407', 'ZIPCode_93437', 'ZIPCode_93460', 'ZIPCode_93524', 'ZIPCode_93555', 'ZIPCode_93561', 'ZIPCode_93611', 'ZIPCode_93657', 'ZIPCode_93711', 'ZIPCode_93720', 'ZIPCode_93727', 'ZIPCode_93907', 'ZIPCode_93933', 'ZIPCode_93940', 'ZIPCode_93943', 'ZIPCode_93950', 'ZIPCode_93955', 'ZIPCode_94002', 'ZIPCode_94005', 'ZIPCode_94010', 'ZIPCode_94015', 'ZIPCode_94019', 'ZIPCode_94022', 'ZIPCode_94024', 'ZIPCode_94025', 'ZIPCode_94028', 'ZIPCode_94035', 'ZIPCode_94040', 'ZIPCode_94043', 'ZIPCode_94061', 'ZIPCode_94063', 'ZIPCode_94065', 'ZIPCode_94066', 'ZIPCode_94080', 'ZIPCode_94085', 'ZIPCode_94086', 'ZIPCode_94087', 'ZIPCode_94102', 'ZIPCode_94104', 'ZIPCode_94105', 'ZIPCode_94107', 'ZIPCode_94108', 'ZIPCode_94109', 'ZIPCode_94110', 'ZIPCode_94111', 'ZIPCode_94112', 'ZIPCode_94114', 'ZIPCode_94115', 'ZIPCode_94116', 'ZIPCode_94117', 'ZIPCode_94118', 'ZIPCode_94122', 'ZIPCode_94123', 'ZIPCode_94124', 'ZIPCode_94126', 'ZIPCode_94131', 'ZIPCode_94132', 'ZIPCode_94143', 'ZIPCode_94234', 'ZIPCode_94301', 'ZIPCode_94302', 'ZIPCode_94303', 'ZIPCode_94304', 'ZIPCode_94305', 'ZIPCode_94306', 'ZIPCode_94309', 'ZIPCode_94402', 'ZIPCode_94404', 'ZIPCode_94501', 'ZIPCode_94507', 'ZIPCode_94509', 'ZIPCode_94521', 'ZIPCode_94523', 'ZIPCode_94526', 'ZIPCode_94534', 'ZIPCode_94536', 'ZIPCode_94538', 'ZIPCode_94539', 'ZIPCode_94542', 'ZIPCode_94545', 'ZIPCode_94546', 'ZIPCode_94550', 'ZIPCode_94551', 'ZIPCode_94553', 'ZIPCode_94555', 'ZIPCode_94558', 'ZIPCode_94566', 'ZIPCode_94571', 'ZIPCode_94575', 'ZIPCode_94577', 'ZIPCode_94583', 'ZIPCode_94588', 'ZIPCode_94590', 'ZIPCode_94591', 'ZIPCode_94596', 'ZIPCode_94598', 'ZIPCode_94604', 'ZIPCode_94606', 'ZIPCode_94607', 'ZIPCode_94608', 'ZIPCode_94609', 'ZIPCode_94610', 'ZIPCode_94611', 'ZIPCode_94612', 'ZIPCode_94618', 'ZIPCode_94701', 'ZIPCode_94703', 'ZIPCode_94704', 'ZIPCode_94705', 'ZIPCode_94706', 'ZIPCode_94707', 'ZIPCode_94708', 'ZIPCode_94709', 'ZIPCode_94710', 'ZIPCode_94720', 'ZIPCode_94801', 'ZIPCode_94803', 'ZIPCode_94806', 'ZIPCode_94901', 'ZIPCode_94904', 'ZIPCode_94920', 'ZIPCode_94923', 'ZIPCode_94928', 'ZIPCode_94939', 'ZIPCode_94949', 'ZIPCode_94960', 'ZIPCode_94965', 'ZIPCode_94970', 'ZIPCode_94998', 'ZIPCode_95003', 'ZIPCode_95005', 'ZIPCode_95006', 'ZIPCode_95008', 'ZIPCode_95010', 'ZIPCode_95014', 'ZIPCode_95020', 'ZIPCode_95023', 'ZIPCode_95032', 'ZIPCode_95035', 'ZIPCode_95037', 'ZIPCode_95039', 'ZIPCode_95045', 'ZIPCode_95051', 'ZIPCode_95053', 'ZIPCode_95054', 'ZIPCode_95060', 'ZIPCode_95064', 'ZIPCode_95070', 'ZIPCode_95112', 'ZIPCode_95120', 'ZIPCode_95123', 'ZIPCode_95125', 'ZIPCode_95126', 'ZIPCode_95131', 'ZIPCode_95133', 'ZIPCode_95134', 'ZIPCode_95135', 'ZIPCode_95136', 'ZIPCode_95138', 'ZIPCode_95192', 'ZIPCode_95193', 'ZIPCode_95207', 'ZIPCode_95211', 'ZIPCode_95307', 'ZIPCode_95348', 'ZIPCode_95351', 'ZIPCode_95354', 'ZIPCode_95370', 'ZIPCode_95403', 'ZIPCode_95405', 'ZIPCode_95422', 'ZIPCode_95449', 'ZIPCode_95482', 'ZIPCode_95503', 'ZIPCode_95518', 'ZIPCode_95521', 'ZIPCode_95605', 'ZIPCode_95616', 'ZIPCode_95617', 'ZIPCode_95621', 'ZIPCode_95630', 'ZIPCode_95670', 'ZIPCode_95678', 'ZIPCode_95741', 'ZIPCode_95747', 'ZIPCode_95758', 'ZIPCode_95762', 'ZIPCode_95812', 'ZIPCode_95814', 'ZIPCode_95816', 'ZIPCode_95817', 'ZIPCode_95818', 'ZIPCode_95819', 'ZIPCode_95820', 'ZIPCode_95821', 'ZIPCode_95822', 'ZIPCode_95825', 'ZIPCode_95827', 'ZIPCode_95828', 'ZIPCode_95831', 'ZIPCode_95833', 'ZIPCode_95841', 'ZIPCode_95842', 'ZIPCode_95929', 'ZIPCode_95973', 'ZIPCode_96001', 'ZIPCode_96003', 'ZIPCode_96008', 'ZIPCode_96064', 'ZIPCode_96091', 'ZIPCode_96094', 'ZIPCode_96145', 'ZIPCode_96150', 'ZIPCode_96651', 'Education_2', 'Education_3']
# list of feature names in x_train
feature_names = list(x_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
model1, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model1, feature_names=feature_names, show_weights=True))
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- weights: [2519.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- ZIPCode_92122 <= 0.50 | | | | | |--- ZIPCode_95039 <= 0.50 | | | | | | |--- ZIPCode_90601 <= 0.50 | | | | | | | |--- ZIPCode_94122 <= 0.50 | | | | | | | | |--- Age <= 26.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 26.50 | | | | | | | | | |--- ZIPCode_92220 <= 0.50 | | | | | | | | | | |--- ZIPCode_94709 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- ZIPCode_94709 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- ZIPCode_92220 > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ZIPCode_94122 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- ZIPCode_90601 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- ZIPCode_95039 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_92122 > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- CCAvg <= 4.40 | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- CCAvg > 4.40 | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- Education_3 <= 0.50 | | | | | |--- Education_2 <= 0.50 | | | | | | |--- Experience <= 37.50 | | | | | | | |--- ZIPCode_90034 <= 0.50 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- ZIPCode_90034 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Experience > 37.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Education_2 > 0.50 | | | | | | |--- ID <= 4134.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | |--- ID > 4134.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Education_3 > 0.50 | | | | | |--- ZIPCode_90277 <= 0.50 | | | | | | |--- ZIPCode_94304 <= 0.50 | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | |--- ZIPCode_94304 > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- ZIPCode_90277 > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CCAvg > 4.45 | | | | |--- Mortgage <= 320.00 | | | | | |--- Experience <= 31.50 | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Mortgage > 320.00 | | | | | |--- weights: [0.00, 1.00] class: 1 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- Experience <= 4.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Experience > 4.50 | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 54.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 33.00 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 33.00 | | | | | | |--- ID <= 1151.00 | | | | | | | |--- Experience <= 19.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Experience > 19.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- ID > 1151.00 | | | | | | | |--- ZIPCode_92120 <= 0.50 | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | |--- ZIPCode_92120 > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 67.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Experience <= 3.50 | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Experience > 3.50 | | | | |--- Age <= 57.50 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 2.90 | | | | | | | |--- ZIPCode_94596 <= 0.50 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- ZIPCode_94596 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 2.90 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- ZIPCode_95054 <= 0.50 | | | | | | | |--- weights: [0.00, 11.00] class: 1 | | | | | | |--- ZIPCode_95054 > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 57.50 | | | | | |--- ZIPCode_94606 <= 0.50 | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | |--- ZIPCode_94606 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- Income > 114.50 | | | |--- weights: [0.00, 155.00] class: 1
Model Performance Improvement¶
Decision Tree (Pre-pruning)¶
# define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = np.arange(10, 51, 10)
min_samples_split_values = np.arange(10, 51, 10)
# initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
# iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
random_state=42
)
# fit the model to the training data
estimator.fit(x_train, y_train)
# make predictions on the training and test sets
y_train_pred = estimator.predict(x_train)
y_test_pred = estimator.predict(x_test)
# calculate F1 scores for training and test sets
train_f1_score = f1_score(y_train, y_train_pred)
test_f1_score = f1_score(y_test, y_test_pred)
# calculate the absolute difference between training and test F1 scores
score_diff = abs(train_f1_score - test_f1_score)
# update the best estimator and best score if the current one has a smaller score difference
if score_diff < best_score_diff:
best_score_diff = score_diff
best_estimator = estimator
# creating an instance of the best model
model2 = best_estimator
# fitting the best model to the training data
model2.fit(x_train, y_train)
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=20, min_samples_split=40,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=20, min_samples_split=40,
random_state=42)Model Evaluation¶
- Checking performance on training data
plot_confusion_matrix(model2, x_train, y_train)
model2_train_perf = model_performance_classification(
model2, x_train, y_train
)
model2_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984857 | 0.883929 | 0.954984 | 0.918083 |
plot_confusion_matrix(model2, x_test, y_test)
model2_test_perf = model_performance_classification(
model2, x_test, y_test
)
model2_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.983333 | 0.875 | 0.947368 | 0.909747 |
- The training and test scores are very close to each other, indicating a generalized performance.
Visualizing the Decision Tree¶
# list of feature names in x_train
feature_names = list(x_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
model2, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
- This is a far less complex tree than the previous one.
- We can observe the decision rules much more clearly in this visual.
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
model2, # specify the model
feature_names=feature_names, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- weights: [2519.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- ZIPCode_92122 <= 0.50 | | | | | |--- Age <= 26.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 26.50 | | | | | | |--- weights: [115.00, 8.00] class: 0 | | | | |--- ZIPCode_92122 > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- weights: [1.00, 3.00] class: 1 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- weights: [13.00, 16.00] class: 1 | | | |--- CCAvg > 4.45 | | | | |--- weights: [13.00, 2.00] class: 0 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- weights: [8.00, 7.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 54.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- weights: [9.00, 7.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 67.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Experience <= 3.50 | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Experience > 3.50 | | | | |--- weights: [18.00, 15.00] class: 0 | | |--- Income > 114.50 | | | |--- weights: [0.00, 155.00] class: 1
Decision Tree (Post-pruning)¶
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=42)
# Compute the cost complexity pruning path for the model using the training data
path = clf.cost_complexity_pruning_path(x_train, y_train)
# Extract the array of effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
# Extract the array of total impurities at each alpha along the pruning path
impurities = path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000278 | 0.001671 |
| 2 | 0.000381 | 0.002052 |
| 3 | 0.000410 | 0.002871 |
| 4 | 0.000429 | 0.003299 |
| 5 | 0.000457 | 0.004214 |
| 6 | 0.000467 | 0.004680 |
| 7 | 0.000490 | 0.005170 |
| 8 | 0.000495 | 0.006160 |
| 9 | 0.000508 | 0.006668 |
| 10 | 0.000512 | 0.010255 |
| 11 | 0.000514 | 0.010769 |
| 12 | 0.000524 | 0.011293 |
| 13 | 0.000524 | 0.011817 |
| 14 | 0.000583 | 0.012400 |
| 15 | 0.000653 | 0.013053 |
| 16 | 0.000667 | 0.015723 |
| 17 | 0.000989 | 0.016712 |
| 18 | 0.000994 | 0.017706 |
| 19 | 0.001000 | 0.018706 |
| 20 | 0.001195 | 0.021097 |
| 21 | 0.001625 | 0.022723 |
| 22 | 0.001782 | 0.024505 |
| 23 | 0.001908 | 0.026413 |
| 24 | 0.002335 | 0.028748 |
| 25 | 0.002970 | 0.031718 |
| 26 | 0.008156 | 0.039874 |
| 27 | 0.025722 | 0.091318 |
| 28 | 0.034690 | 0.126007 |
| 29 | 0.047561 | 0.173568 |
# Create a figure
fig, ax = plt.subplots(figsize=(14, 7))
# Plot the total impurities versus effective alphas, excluding the last value,
# using markers at each data point and connecting them with steps
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
# Set the x-axis label
ax.set_xlabel("Effective Alpha")
# Set the y-axis label
ax.set_ylabel("Total impurity of leaves")
# Set the title of the plot
ax.set_title("Total Impurity vs Effective Alpha for training set");
Training Decision Tree using the Effective Alphas¶
I train a decision tree using a range of effective alpha values. The final entry in ccp_alphas represents the pruning level that collapses the entire tree, resulting in clfs[-1], a tree with only a single node.
# Initialize an empty list to store the decision tree classifiers
clfs = []
# Iterate over each ccp_alpha value extracted from cost complexity pruning path
for ccp_alpha in ccp_alphas:
# Create an instance of the DecisionTreeClassifier
clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
# Fit the classifier to the training data
clf.fit(x_train, y_train)
# Append the trained classifier to the list
clfs.append(clf)
# Print the number of nodes in the last tree along with its ccp_alpha value
print(
"Number of nodes in the last tree is {} with ccp_alpha {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is 1 with ccp_alpha 0.04756053380018527
# We need to remove the last element in clfs and ccp_alphas as it corresponds to a trivial tree with only one node
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Extract the number of nodes in each tree classifier
node_counts = [clf.tree_.node_count for clf in clfs]
# Extract the maximum depth of each tree classifier
depth = [clf.tree_.max_depth for clf in clfs]
# Create a figure and a set of subplots
fig, ax = plt.subplots(2, 1, figsize=(14, 12))
# Plot the number of nodes versus ccp_alphas on the first subplot
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")
# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")
# Adjust the layout of the subplots to avoid overlap
fig.tight_layout()
train_f1_scores = [] # Initialize an empty list to store F1 scores for training set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the training set using the current decision tree classifier
pred_train = clf.predict(x_train)
# Calculate the F1 score for the training set predictions compared to true labels
f1_train = f1_score(y_train, pred_train)
# Append the calculated F1 score to the train_f1_scores list
train_f1_scores.append(f1_train)
test_f1_scores = [] # Initialize an empty list to store F1 scores for test set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the test set using the current decision tree classifier
pred_test = clf.predict(x_test)
# Calculate the F1 score for the test set predictions compared to true labels
f1_test = f1_score(y_test, pred_test)
# Append the calculated F1 score to the test_f1_scores list
test_f1_scores.append(f1_test)
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha") # Set the label for the x-axis
ax.set_ylabel("F1 Score") # Set the label for the y-axis
ax.set_title("F1 Score vs Alpha for training and test sets") # Set the title of the plot
# Plot the training F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, train_f1_scores, marker="o", label="training", drawstyle="steps-post")
# Plot the testing F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, test_f1_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend(); # Add a legend to the plot
# creating the model where we get highest test F1 Score
index_best_model = np.argmax(test_f1_scores)
# selcting the decision tree model corresponding to the highest test score
model3 = clfs[index_best_model]
print(model3)
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641, random_state=42)
Model Evaluation¶
plot_confusion_matrix(model3, x_train, y_train)
model3_train_perf = model_performance_classification(
model3, x_train, y_train
)
model3_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988857 | 0.928571 | 0.954128 | 0.941176 |
plot_confusion_matrix(model3, x_test, y_test)
model3_test_perf = model_performance_classification(
model3, x_test, y_test
)
model3_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.881944 | 0.947761 | 0.913669 |
- The test score is greater than the training score, indicating a generalized performance.
Decision Tree Visualization¶
# list of feature names in x_train
feature_names = list(x_train.columns)
# set the figure size for the plot
plt.figure(figsize=(14, 8))
# plotting the decision tree
out = tree.plot_tree(
model3, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
- This is a far less complex tree than the previous two.
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
model3, # specify the model
feature_names=feature_names, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- weights: [2519.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [115.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [1.00, 3.00] class: 1 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- weights: [13.00, 16.00] class: 1 | | | |--- CCAvg > 4.45 | | | | |--- weights: [13.00, 2.00] class: 0 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- weights: [8.00, 1.00] class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 54.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- weights: [9.00, 7.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 67.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Experience <= 3.50 | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Experience > 3.50 | | | | |--- Experience <= 31.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [6.00, 3.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [1.00, 11.00] class: 1 | | | | |--- Experience > 31.50 | | | | | |--- weights: [11.00, 1.00] class: 0 | | |--- Income > 114.50 | | | |--- weights: [0.00, 155.00] class: 1
Income is by far the most influential factor in predicting whether a liability customer will purchase a personal loan, contributing the highest to the model's decisions.
Family size ranks second, indicating that customers with more dependents or larger households are more likely to consider personal loans.
Education levels, specifically Education_2 and Education_3, are also strong predictors. This shows that educational attainment plays a notable role in loan adoption.
Features like CCAvg (average credit card spend) and Age show moderate importance, suggesting their influence is present but secondary.
A large number of features (e.g., ZIPCode _* , Mortgage , CreditCard, Securities_Account) show very low to negligible importance, implying they do not significantly contribute to the model's predictive capability.
Model Performance Comparison and Final Model Selection¶
# training performance comparison
models_train_comp_df = pd.concat(
[
model1_train_perf.T,
model2_train_perf.T,
model3_train_perf.T
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 0.984857 | 0.988857 |
| Recall | 1.0 | 0.883929 | 0.928571 |
| Precision | 1.0 | 0.954984 | 0.954128 |
| F1 | 1.0 | 0.918083 | 0.941176 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
model1_test_perf.T,
model2_test_perf.T,
model3_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.982000 | 0.983333 | 0.984000 |
| Recall | 0.881944 | 0.875000 | 0.881944 |
| Precision | 0.927007 | 0.947368 | 0.947761 |
| F1 | 0.903915 | 0.909747 | 0.913669 |
- Pre-pruning and Post-pruning both outperform the default decision tree in every metric.
- The default decision tree achieved an accuracy of 98.20%, whereas both Pre-Pruning and Post-Pruning models achieved a higher accuracy of 98.40%.
- This shows that pruning improved overall prediction performance by +0.47 percentage points.
- Post-pruning has the highest recall (88.19%), meaning it catches more true positives (loan approvals).
- Precision is also high across all models, but peaks with Pre-pruning, slightly ahead of post-pruning.
- For F1, the highest in Post-pruning (91.37%), indicating it offers the best trade-off between false positives and false negatives.
Predicting on a Single Data Point¶
%%time
# choosing a data point
applicant_details = x_test.iloc[:1, :]
# making a prediction
approval_prediction = model2.predict(applicant_details)
print(approval_prediction)
[1] CPU times: user 5.37 ms, sys: 0 ns, total: 5.37 ms Wall time: 5.61 ms
- [1] - means that the decision tree model (model2) predicted the applicant will be approved for a personal loan
- This shows the prediction process is very fast, taking under 100 milliseconds end-to-end, which is expected for decision tree inference.
- Instead of predicting a class (approve/reject), the model can also predict the likelihood of approval.
# making a prediction
approval_likelihood = model2.predict_proba(applicant_details)
print(approval_likelihood[0, 1])
1.0
- This indicates that the model is ~100% confident that a liability customer will buy personal loans.
Feature Importance Analysis¶
# Assuming you have a trained model and X_train with column names
importances = model1.feature_importances_
features = x_train.columns
# Create a DataFrame for visualization
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Plot
top_n = 20
top_features = importance_df.head(top_n)
plt.figure(figsize=(10, 6))
plt.barh(top_features['Feature'], top_features['Importance'])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Significant Customer Attributes')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
- Income – Most significant by far; customers with higher income are more likely to buy personal loans.
- Family – Plays a major role; likely due to family size influencing financial needs.
- Education_2 & Education_3 – Education level significantly affects loan purchasing likelihood.
- CCAvg (Credit Card Average Spending) – Medium importance; suggests spending habits affect loan decisions.
- Age, ID, Experience – Small contribution; may help but not strong indicators alone.
- CD_Account, Mortgage, Online, ZIP codes, Securities_Account, CreditCard – Very low importance, possibly irrelevant in prediction.
- ZIP Codes (especially ZIPCode_91–96) contribute almost nothing individually.
Decision Tree Rule Extraction¶
print(export_text(model3, feature_names=X.columns.tolist()))
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- class: 0 | | | |--- CD_Account > 0.50 | | | | |--- class: 1 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- class: 1 | | | |--- CCAvg > 4.45 | | | | |--- class: 0 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- class: 1 | | | | |--- Income > 116.50 | | | | | |--- class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- class: 0 | | | |--- Income > 116.50 | | | | |--- class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Experience <= 3.50 | | | | |--- class: 0 | | | |--- Experience > 3.50 | | | | |--- Experience <= 31.50 | | | | | |--- Family <= 3.50 | | | | | | |--- class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- class: 1 | | | | |--- Experience > 31.50 | | | | | |--- class: 0 | | |--- Income > 114.50 | | | |--- class: 1
Actionable Insights and Business Recommendations¶
What recommendations would you suggest to the bank? My suggestions are as follow:
- Target high-income, graduate customers with high card spend.
- Focus on online, mortgage-holding, and certificate of deposit account users.
- Monitor model and campaign performance.
- The bank should put more efforts in running campaigns digitally.
In response to the objective of the project as stated in the problem statement.¶
1. Will a liability customer buy a personal loan?
- The decision tree models — particularly the Post-Pruning Decision Tree — predict this with high accuracy (98.4%) on the test set, indicating that the model performs well in identifying potential buyers.
2. Which customer attributes are most significant? The following attributes are significant:
- Income
- Family Size
- Education Level
- average credit card spend (Moderate)
- Age (Moderate)
3. Which customer segment should be targeted? Using the tree rule, the best target segments are:
- High credit card usage (> 2.95k USD), even with moderate income.
- Income between 92.5k USD and 104.5k USD, high CCAvg (> 4.45k USD).
- High income (104.5k–116.5k USD), graduate education, small family.
- Very high income (> 116.5k USD), graduate education.
- High income + larger families.
- Customers with existing liabilities but good credit
- Users with online banking activity (indicates tech-savviness and engagement)
- Customers with mortgage but no personal loan (cross-sell opportunity)